Left-corner chart parsing

ABSTRACT

Different embodiments of the present invention provide improvements to left-corner chart parsing. The improvements include a specific order of filtering checks, transforming the grammar using bottom-up prefix merging, indexing productions first based on input symbols, grammar flattening, and annotating chart edges for the extraction of parses.

REFERENCE TO CO-PENDING APPLICATION

The present application is a continuation of and claims priority of U.S.patent application Ser. No. 10/927,360, filed Aug. 26, 2004 now U.S.Pat. No. 7,027,977 which is a is a divisional of and claims priority ofU.S. patent application Ser. No. 09/510,020, filed Feb. 22, 2000, nowU.S. Pat. No. 6,999,917 the content of which is hereby incorporated byreference in its entirety.

Reference is hereby made to co-pending U.S. patent application Ser. No.09/441,685, entitled ELIMINATION OF LEFT RECURSION FROM CONTEXT-FREEGRAMMARS, filed on Nov. 16, 1999.

BACKGROUND OF THE INVENTION

The present invention deals with parsing text. More specifically, thepresent invention deals with improvements in left-corner chart parsing.

Parsing refers to the process of analyzing a text string into itscomponent parts and categorizing those parts. This can be part ofprocessing either artificial languages (C++, Java, HTML, XML, etc.) ornatural languages (English, French, Japanese, etc.). For example,parsing the English sentence, the man with the umbrella opened the largewooden door, would normally involve recognizing that:

-   -   opened is the main verb of the sentence,    -   the subject of opened is the noun phrase the man with the        umbrella,    -   the object of opened is the noun phrase the large wooden door,        with the man with the umbrella and the large wooden door being        further analyzed into their component parts. The fact that        parsing is nontrivial is illustrated by the fact that the        sentence contains the substring the umbrella opened, which in        isolation could be a full sentence, but in this case is not even        a complete phrase of the larger sentence.

Parsing by computer is sometimes performed by a program that is specificto a particular language, but often a general-purpose parsing algorithmis used with a formal grammar for a specific language to parse stringsin that language. That is, rather than having separate programs forparsing English and French, a single program is used to parse bothlanguages, but it is supplied with a grammar of English to parse Englishtext, and a grammar of French to parse French text.

Perhaps the most fundamental type of formal grammar is context-freegrammar. A context-free grammar consists of terminal symbols, which arethe tokens of the language; a set of nonterminal symbols, which areanalyzed into sequences of terminals and other nonterminals; a set ofproductions, which specify the analyses; and a distinguished “top”nonterminal symbol, which specifies the strings that can stand alone ascomplete expressions of the language.

The productions of a context-free grammar can be expressed in the formA→X₁ . . . X_(n) where A is a single nonterminal symbol, and X₁ . . .X_(n) is a sequence of n terminals and/or nonterminals. Theinterpretation of a production A→X₁ . . . X_(n) is that a string can becategorized by the nonterminal A if it consists of a sequence ofcontiguous substrings that can be categorized by X₁ . . . X_(n).

The goal of parsing is to find an analysis of a string of text as aninstance of the top symbol of the grammar, according to the productionsof the grammar. To illustrate, suppose we have the following grammar fora tiny fragment of English:

-   -   S→NP VP    -   NP→Name    -   Name→john    -   Name→mary    -   VP→V NP    -   V→likes

In this grammar, terminals are all lower case, nonterminals begin withan upper case letter, and S is the distinguished top symbol of thegrammar. The productions can be read as saying that a sentence canconsist of a noun phrase followed by a verb phrase, a noun phrase canconsist of a name, john and mary can be names, a verb phrase can consistof a verb followed by a noun phrase, and likes can be a verb. It shouldbe easy to see that the string john likes mary can be analyzed as acomplete sentence of the language defined by this grammar according thefollowing structure:

-   -   (S: (NP: (Name: john))        -   (VP: (V: likes)            -   (NP: (Name: mary))))

For parsing natural language, often grammar formalisms are used thataugment context-free grammar in some way, such as adding features to thenonterminal symbols of the grammar, and providing a mechanism topropagate and test the values of the features. For example, thenonterminals NP and VP might be given the feature number, which can betested to make sure that singular subjects go with singular verbs andplural subjects go with plural verbs. Nevertheless, evennatural-language parsers that use one of these more complex grammarformalisms are usually based on some extension of one of the well-knownalgorithms for parsing with context-free grammars.

Grammars for artificial languages, such as programming languages (C++,Java, etc.) or text mark-up languages (HTML, XML, etc.) are usuallydesigned so that they can be parsed deterministically. That is, they aredesigned so that the grammatical structure of an expression can be builtup one token at a time without ever having to guess how things fittogether. This means that parsing can be performed very fast and israrely a significant performance issue in processing these languages.

Natural languages, on the other hand, cannot be parseddeterministically, because it is often necessary to look far aheadbefore it can be determined how an earlier phrase is to be analyzed.Consider for example the two sentences:

-   -   Visiting relatives often stay too long.    -   Visiting relatives often requires a long trip.

In the first sentence, visiting relatives refers to relatives who visit,while in the second sentence it refers to the act of paying a visit torelatives. In any reasonable grammar for English, these two instances ofvisiting relatives would receive different grammatical analyses. Theearliest point in the sentences where this can be determined, however,is after the word often. It is hard to imagine a way to parse thesesentences, such that the correct analysis could be assigned withcertainty to visiting relatives before it is combined with the analysisof the rest of the sentence.

The existence of nondeterminacy in parsing natural languages means thatsometimes hundreds, or even thousands, of hypotheses about the analysesof parts of a sentence must be considered before a complete parse of theentire sentence is found. Moreover, many sentences are grammaticallyambiguous, having multiple parses that require additional information tochose between. In this case, it is desirable to be able to find allparses of a sentence, so that additional knowledge sources can be usedlater to make the final selection of the correct parse. The high degreeof nondeterminacy and ambiguity in natural languages means that parsingnatural language is computationally expensive, and as grammars are mademore detailed in order to describe the structure of natural-languageexpressions more accurately, the complexity of parsing with thosegrammars increases. Thus in almost every application of natural-languageprocessing, the computation time needed for parsing is a serious issue,and faster parsing algorithms are always desirable to improveperformance.

“Chart parsing” or “tabular parsing” refers to a broad class ofefficient parsing algorithms that build a collection of data structuresrepresenting segments of the input partially or completely analyzed as aphrase of some category in the grammar. These data structures areindividually referred to as “edges” and the collection of edges derivedin parsing a particular string is referred to as a “chart”. In thesealgorithms, efficient parsing is achieved by the use of dynamicprogramming, which simply means that if the same chart edge is derivedin more than one way, only one copy is retained for further processing.

The present invention is directed to a set of improvements to aparticular family of chart parsing algorithms referred to as“left-corner” chart parsing. Left-corner parsing algorithms aredistinguished by the fact that an instance of a given production ishypothesized when an instance of the left-most symbol on the right-handside of the production has been recognized. This symbol is sometimescalled the “left corner” of the production; hence, the name of theapproach. For example, if VP→V NP is a production in the grammar, and aterminal symbol of category V has been found in the input, then aleft-corner parsing algorithm would consider the possibility that the Vin the input should combine with a NP to its right to form a VP.

SUMMARY OF THE INVENTION

Different embodiments of the present invention provide improvements toleft-corner chart parsing. The improvements include a specific order offiltering checks, transforming the grammar using bottom-up prefixmerging, indexing productions first based on input symbols, grammarflattening, and annotating chart edges for the extraction of parses.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an exemplary environment in which thepresent invention can be implemented.

FIG. 2 is a block diagram of a left-corner chart parser.

FIGS. 3A-3C are flow diagrams illustrating the performance of abottom-up left-corner check and a top-down left-corner check inaccordance with one embodiment of the present invention.

FIGS. 4 and 5 are flow diagrams illustrating a bottom-up prefix mergingtransformation in accordance with one embodiment of the presentinvention.

FIGS. 6A and 6B illustrate a data structure used in indexing productionsand a method of using that data structure.

FIGS. 7A and 7B illustrate a data structure used in indexing productionsand a method of using that data structure in accordance with oneembodiment of the present invention.

FIGS. 8 and 9 illustrate grammar flattening.

FIGS. 10 and 11 illustrate methods of performing grammar flattening inaccordance with embodiments of the present invention.

FIG. 12A is a data structure used in annotating chart edges inaccordance with one embodiment of the present invention.

FIG. 12B illustrates a trace-back of chart edges to obtain an analysisof an input text in accordance with one embodiment of the presentinvention.

FIGS. 13, 14A and 14B illustrate the trace-back of chart edges, usingannotations on those edges, in accordance with another embodiment of thepresent invention.

DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS OVERVIEW OFENVIRONMENT

The discussion of FIG. 1 below is simply to set out but one illustrativeenvironment in which the present invention can be used, although it canbe used in other environments as well.

FIG. 1 is a block diagram of a computer 20 in accordance with oneillustrative embodiment of the present invention. FIG. 1 and the relateddiscussion are intended to provide a brief, general description of asuitable computing environment in which the invention may beimplemented. Although not required, the invention will be described, atleast in part, in the general context of computer-executableinstructions, such as program modules, being executed by a personalcomputer. Generally, program modules include routine programs, objects,components, data structures, etc. that perform particular tasks orimplement particular abstract data types. Moreover, those skilled in theart will appreciate that the invention may be practiced with othercomputer system configurations, including hand-held devices,multiprocessor systems, microprocessor-based or programmable consumerelectronics, network PCs, minicomputers, mainframe computers, and thelike. The invention may also be practiced in distributed computingenvironments where tasks are performed by remote processing devices thatare linked through a communications network. In a distributed computingenvironment, program modules may be located in both local and remotememory storage devices.

In FIG. 1, an exemplary system for implementing the invention includes ageneral purpose computing device in the form of a conventional personalcomputer 20, including processing unit 21, a system memory 22, and asystem bus 23 that couples various system components including thesystem memory to the processing unit 21. The system bus 23 may be any ofseveral types of bus structures including a memory bus or memorycontroller, a peripheral bus, and a local bus using any of a variety ofbus architectures. The system memory includes read only memory (ROM) 24a random access memory (RAM) 25. A basic input/output 26 (BIOS),containing the basic routine that helps to transfer information betweenelements within the personal computer 20, such as during start-up, isstored in ROM 24. The personal computer 20 further includes a hard diskdrive 27 for reading from and writing to a hard disk (not shown), amagnetic disk drive 28 for reading from or writing to removable magneticdisk 29, and an optical disk drive 30 for reading from or writing to aremovable optical disk 31 such as a CD ROM or other optical media. Thehard disk drive 27, magnetic disk drive 28, and optical disk drive 30are connected to the system bus 23 by a hard disk drive interface 32,magnetic disk drive interface 33, and an optical drive interface 34,respectively. The drives and the associated computer-readable mediaprovide nonvolatile storage of computer readable instructions, datastructures, program modules and other data for the personal computer 20.

Although the exemplary environment described herein employs a hard disk,a removable magnetic disk 29 and a removable optical disk 31, it shouldbe appreciated by those skilled in the art that other types of computerreadable media that can store data that is accessible by a computer,such as magnetic cassettes, flash memory cards, digital video disks,Bernoulli cartridges, random access memories (RAMs), read only memory(ROM), and the like, may also be used in the exemplary operatingenvironment.

A number of program modules may be stored on the hard disk, magneticdisk 29, optical disk 31, ROM 24 or RAM 25, including an operatingsystem 35, one or more application programs 36, other program modules37, and program data 38. A user may enter commands and information intothe personal computer 20 through input devices such as a keyboard 40 andpointing device 42. Other input devices (not shown) may include amicrophone, joystick, game pad, satellite dish, scanner, or the like.These and other input devices are often connected to the processing unit21 through a serial port interface 45 that is coupled to the system bus23, but may be connected by other interfaces, such as a sound card, aparallel port, a game port or a universal serial bus (USB). A monitor 47or other type of display device is also connected to the system bus 23via an interface, such as a video adapter 48. In addition to the monitor47, personal computers may typically include other peripheral outputdevices such as a speaker and printers (not shown).

The personal computer 20 may operate in a networked environment usinglogic connections to one or more remote computers, such as a remotecomputer 49. The remote computer 49 may be another personal computer, aserver, a router, a network PC, a peer device or other network node, andtypically includes many or all of the elements described above relativeto the personal computer 20, although only a memory storage device 50has been illustrated in FIG. 1. The logic connections depicted in FIG. 1include a local are network (LAN) 51 and a wide area network (WAN) 52.Such networking environments are commonplace in offices, enterprise-widecomputer network intranets and the Internet.

When used in a LAN networking environment, the personal computer 20 isconnected to the local area network 51 through a network interface oradapter 53. When used in a WAN networking environment, the personalcomputer 20 typically includes a modem 54 or other means forestablishing communications over the wide area network 52, such as theInternet. The modem 54, which may be internal or external, is connectedto the system bus 23 via the serial port interface 46. In a networkenvironment, program modules depicted relative to the personal computer20, or portions thereof, may be stored in the remote memory storagedevices. It will be appreciated that the network connections shown areexemplary and other means of establishing a communications link betweenthe computers may be used.

Overview of Parsing Notation and Rules

FIG. 2 is a simplified block diagram of a left-corner chart parser. FIG.2 illustrates that left-corner chart parser 150 receives an input textstring and provides at its output an analysis of the input text string.An exemplary input text string, and an exemplary analysis, are discussedbelow in greater detail. FIG. 2 also illustrates that, part ofleft-corner chart parser 150 includes a left-corner index table 152which is used generating a chart, as is also described in greater detailbelow.

In the notation that follows, nonterminals, which will sometimes bereferred to as categories, will be designated by “low order” upper-caseletters (A, B, etc.); and terminals will be designated by lower-caseletters. The notation a_(i) indicates the ith terminal symbol in theinput string. “High order” upper-case letters (X, Y, Z) denote singlesymbols that could be either terminals or nonterminals, and Greekletters denote (possibly empty) sequences of terminals and/ornonterminals. For a grammar production A→B₁ . . . B_(n) we will refer toA as the mother of the production and to B₁ . . . B_(n) as the daughtersof the production. The nonterminal symbol S is used as the top symbol ofthe grammar, which subsumes all sentences allowed by the grammar.

The term “item”, as used herein, means an instance of a grammarproduction with a “dot” somewhere on the right-hand side to indicate howmany of the daughters have been recognized in the input, e.g., A→B₁.B₂.An “incomplete item” is an item with at least one daughter to the rightof the dot, indicating that at least one more daughter remains to berecognized before the entire production is matched; and a “completeitem” is an item with no daughters to the right of the dot, indicatingthat the entire production has been matched.

The terms “incomplete edge” or “complete edge” mean an incomplete itemor complete item, plus two input positions indicating the segment of theinput covered by the daughters that have already been recognized. Thesewill be written as (e.g.) <A→B₁B₂.B₃,i,j>, which means that the sequenceB₁B₂ has been recognized starting at position i and ending at positionj, and has been hypothesized as part of a longer sequence ending in B₃,which is classified a phrase of category A. The symbol immediatelyfollowing the dot in an incomplete edge is often of particular interest.These symbols are referred to as “predictions”. Positions in the inputwill be numbered starting at 0, so the ith terminal of an input stringspans position i−1 to i. Items and edges, none of whose daughters haveyet been recognized, are referred to as “initial”.

Left-corner (LC) parsing depends on the left-corner relation for thegrammar, where X is recursively defined to be a left corner of A if X=A,or the grammar contains a production of the form B→Xα, where B is a leftcorner of A. This relation is normally precompiled and indexed so thatany pair of symbols can be checked in essentially constant time.

A chart-based LC parsing algorithm can be defined by the following setof rules for populating the chart:

-   -   1. For every grammar production with S as its mother, S→α, add        <S→.α,0,0> to the chart.    -   2. For every pair of edges of the form <A→α.Xβ, i, k> and <X→γ.,        k, j> in the chart, add <A→α.Xβ,i,j> to the chart.    -   3. For every edge of the form <A→α.a_(j)β,i,j−1> in the chart,        where a_(j) is the jth terminal in the input, add        <A→αa_(j.)β,i,j> to the chart.    -   4. For every edge of the form <X→γ.,k,j> in the chart and every        grammar production with X as its left-most daughter, of the form        B→Xδ, if there is an incomplete edge in the chart ending at k,        <A→α.Cβ,i,k>, such that B is a left corner of C, add <B→X.δ,k,j>        to the chart.    -   5. For every input terminal a_(j) and every grammar production        with a_(j) as its left-most daughter, of the form B→a_(j)δ, if        there is an incomplete edge in the chart ending at j−1,        <A→α.Cβ,i,j−1>, such that B is a left corner of C, add        <B→a_(j).δ, j−1,j> to the chart.        Note that for Rules 4 and 5 to be executed efficiently, parsing        should be performed strictly left-to-right, so that every        incomplete edge ending at k has already been computed before any        left-corner checks are performed for new edges proposed from        complete edges or input terminals starting at k. Apart from this        constraint that requires every edge ending at any point k to be        generated before any edges ending at points greater than k,        individual applications of Rules 1-5 may be intermixed in any        order. An input string is successfully parsed as a sentence by        this algorithm if the chart contains an edge of the form        <S→α.,0,n> when the algorithm terminates.

This formulation of left-corner chart parsing is essentially known.Another prior publication describes a similar algorithm, but formulatedin terms of a graph-structured stack of the sort generally associatedwith another form of parsing called generalized LR parsing, rather thanin terms of a chart.

Several additional optimizations can be added to this basic schema. Oneprior technique adds bottom-up filtering of incomplete edges based onthe next terminal in the input. That is, no incomplete edge of the form<A→α.Xβ,i,j> is added to the chart unless a_(j+1) is a left corner of X.Another prior author proposes that, rather than iterate over all theincomplete edges ending at a given input position each time aleft-corner check is performed, compute just once for each inputposition the set of nonterminal predictions of the incomplete edgesending at that position, and iterate over that set for each left-cornercheck at the position. With this optimization, it is no longer necessaryto add initial edges to the chart at position 0 for productions of theform S→α. If P_(i) denotes the set of predictions for position i, wesimply let P₀={S}.

Another prior optimization results from the observation that in priorcontext-free grammar parsing algorithms, the daughters to the left ofthe dot in an item play no role in the parsing algorithm; thus therepresentation of items can ignore the daughters to the left of the dot,resulting in fewer distinct edges to be considered. This observation isequally true for left-corner parsing. Thus, instead of A→B₁B₂.B₃, onewrites simply A→.B₃. Note that with this optimization, A→. becomes thenotation for an item all of whose daughters have been recognized; theonly information it contains being just the mother of the production.The present discussion proceeds therefore by writing complete edgessimply as <A,i,j>, rather than <A→.,i,j>. One can also unify thetreatment of terminal symbols in the input with complete edges in thechart by adding a complete edge <a_(i),i−1,i>, to the chart for everyinput terminal a_(i).

Taking all these optimizations together, we can define a known optimizedleft-corner parsing algorithm by the following set of parsing rules:

-   -   1. Let P₀={S}.    -   2. For every input position j>0, let P_(j)={B| there is an        incomplete edge in the chart ending at j, of the form        <A→.Bα,i,j>}.    -   3. For every input terminal a_(i), add <a_(i),i−1,i> to the        chart.    -   4. For every pair of edges <A→.XYα,i,k> and <X,k,j> in the        chart, if a_(j+1) is a left corner of Y, add <A→.Yα,i,j> to the        chart.    -   5. For every pair of edges <A→.X,i,k> and <X,k,j> in the chart,        add <A,i,j> to the chart.    -   6. For every edge <X,k,j> in the chart and every grammar        production with X as its left-most daughter, of the form A→XYα,        if there is a Bε P_(k) such that A is a left corner of B, and        a_(j+1) is a left corner of Y, add <A→.Yα,k,j> to the chart.    -   7. For every edge <X,k,j> in the chart and every grammar        production with X as its only daughter, of the form A→X, if        there is a BεP_(k) such that A is a left corner of B, add        <A,k,j> to the chart.

Order of Filtering Checks

Note that in Rule 6, the top-down left-corner check on the mother of theproposed incomplete edge and the bottom-up left-corner check on theprediction of the proposed incomplete edge are independent of eachother, and therefore could be performed in either order. For eachproposed edge, the top-down check determines whether the mother A of thegrammar production is a left-corner of any prediction at input positionk, in order to determine whether the production is consistent with whathas already been recognized. This requires examining an entry in aleft-corner table for each of the elements of the prediction list (i.e.,the predictions in the incomplete edges), until a check succeeds or thelist is exhausted. The bottom-up check determines whether the terminalin the j+1st position (a_(j+1)) of the input is a left-corner of Y. Thisrequires examining only one entry in the left-corner table.

Therefore, in accordance with one embodiment of the present invention,the bottom-up check is performed before the top-down check, since thetop-down check need not be performed if the bottom-up check fails. Ithas been found experimentally that performing the filtering steps inthis order is always faster, by as much as 31%.

FIGS. 3A-3C are flow diagrams that illustrate the performance of thefiltering (or checking) steps in greater detail in accordance with oneembodiment of the present invention. FIG. 3 illustrates that, for everyedge of the form <X,k,j> in the chart being constructed, and for everygrammar production with X as its left-most daughter, of the form A→XYα,the bottom-up left-corner filtering step is performed on the predictionY of the proposed incomplete edge <A→.Yα,k,j>. This is indicated byblocks 154, 156 and 158 in FIG. 3A. Next, it is determined whether thebottom-up left-corner check has been satisfied. This is indicated byblock 160. If the check has not been satisfied, then the proposedincomplete edge is not added to the chart and the filtering step iscompleted. However, if the bottom-up left-corner check has beensatisfied, then the top-down left-corner check is performed on themother A of the proposed incomplete edge <A→.Yα,k,j>. This is indicatedby block 162.

It is next determined whether the top-down left-corner check has beensatisfied. If not, again the proposed incomplete edge is not added tothe chart and the filtering procedure is complete. If so, however, thenthe proposed incomplete edge <A→.Yα,k,j> is added to the chart. This isindicated by blocks 164 and 166 in FIG. 3A.

FIG. 3B is a more detailed flow diagram illustrating the performance ofthe bottom-up left-corner test on the prediction Y of the proposedincomplete edge. First, the next terminal in the input text is examinedby parser 150. This is indicated by block 168 in FIG. 3B. Theleft-corner table is then accessed. The left-corner table, in oneembodiment, can be thought of as a set of pairs of the form (X,Y),meaning that X is a left corner of Y. The left-corner table can beimplemented, in one embodiment, in the form of nested hash tables. It isdetermined whether the left-corner table contains an entry for the pairconsisting of the next input terminal and the left-corner of theprediction Y. If not, then the prediction Y cannot be correct and thusthe proposed incomplete edge under consideration cannot be correct so itis not added to the chart. This is indicated by blocks 170 and 171 inFIG. 3B.

However, if the next input terminal and the prediction Y do satisfy theleft-corner check, then the bottom-up left-corner test is satisfied andthe top-down left-corner check can be performed. This is indicated byblock 172 in FIG. 3B.

FIG. 3C illustrates the top-down left-corner check on the mother A ofthe proposed edge in greater detail. The top-down check is basicallychecking to see whether the mother of the proposed incomplete edge isconsistent with edges previously found in the input text. Therefore, aprediction from the incomplete edges ending at the corresponding inputposition is selected from the chart. Next, the left-corner table isexamined to see whether the mother A is a left corner of thatprediction. This is indicated by blocks 174 and 176 in FIG. 3C. If not,then the production with A as its mother is inconsistent with theincomplete edges containing the selected prediction. This is repeateduntil a match is found or no predictions are left to be tested. At thatpoint, if no match has been found, the top-down left-corner check is notsatisfied. This is indicated by blocks 177 and 178, and the productionis not added to the chart.

However, if the mother A is a left-corner of a prediction of anincomplete edge already in the chart ending at the corresponding inputposition, then the top-down left-corner test is satisfied, meaning thatthe production with A as its mother is, to this point, still consistentwith edges that have already been found in the input text. This isindicated by block 180 in FIG. 3C.

Bottom-Up Prefix Merging

In left-to-right parsing, if two grammar productions share a common leftprefix, e.g., A→BC and A→BD, many current parsing algorithms duplicatework for the two productions until reaching the point where they differ.A simple solution often proposed to address this problem is to “leftfactor” the grammar. Left factoring applies the following grammartransformation repeatedly, until it is no longer applicable.

For each nonterminal A, let α be the longest nonempty sequence such thatthere is more than one grammar production of the form A→αβ. Replace theset of productions A→αβ₁, . . . , A→αβ_(n) with A→αA′, A′→β₁, . . . ,A′→β_(n), where A′ is a new nonterminal symbol.

Left factoring applies only to sets of productions with a common mothercategory, but as an essentially bottom-up method, LC parsing does mostof its work before the mother of a production is determined. Anothergrammar transformation was introduced in prior parsing techniques, asfollows:

-   -   Let α be the longest sequence of at least two symbols such that        there is more than one grammar production of the form A→αβ.        Replace the set of productions A₁→αβ₁, . . . , A_(n)→αβ_(n) with        A′→α, A₁→A′β₁, . . . , A_(n)→A′β_(n) where A′ is a new        nonterminal symbol.        Like left factoring, this transformation is repeated until it is        no longer applicable. While this transformation has been applied        to left-corner stack based parsing it has never been applied to        left-corner chart parsing. In that context, and in accordance        with one embodiment of the present invention, it is referred to        herein as “bottom-up prefix merging”.

FIGS. 4 and 5 are flow diagrams illustrating the application ofbottom-up prefix merging in accordance with one embodiment of thepresent invention. First, the productions in the grammar are examined tofind multiple productions having the longest sequence of at least twosimilar symbols in the left-most position on the right hand side of thedifferent productions. This is indicated by block 300 in FIG. 4. Then,the bottom-up prefix merging transformation is applied to thoseproductions, regardless of whether the mother of the productions is thesame. This is indicated by block 302. The transformed grammarproductions are then output as the new grammar. This is indicated byblock 304.

FIG. 5 is a flow diagram illustrating the application of the bottom-upprefix merging transformation in more detail. First, the set ofproductions in the grammar that have the form illustrated in block 306are retrieved. The retrieved productions are transformed intoproductions of another form illustrated in block 308 of FIG. 5. Thesteps of retrieving the set of productions and transforming thoseproductions are iterated on until the transform is no longer applicable.This is indicated by block 310 in FIG. 5.

It can thus be seen that this transformation examines the prefix of theright hand side of the productions to eliminate duplication of work fortwo productions that have a similar prefix on their right hand sides,regardless of the mother of the production.

It has been found experimentally that left factoring generally makesleft-corner chart parsing slower rather than faster. Bottom-up prefixmerging, on the other hand, speeds up left-corner chart parsing by asmuch as 70%.

Indexing Productions by Next Input Symbol

In general, it is most efficient to store the grammar productions forparsing in a data structure that partially combines productions thatshare elements in common, in the order that those elements are examinedby the parsing algorithm. Therefore, the grammar productions for thepresent left-corner chart parser are stored as a discrimination tree,implemented as a set of nested hash tables. In addition, productionswith only one daughter are stored separately from those with more thanone daughter. One way to define a data structure for the latter isillustrated in FIG. 6A.

FIG. 6A shows that a first data portion in the data structure 200 is anindex that contains pointers to data structures for productions indexedby their left-most daughter 202. This is because left-corner parsingproposes a grammar production when its left-most daughter has beenfound, so productions are indexed first by that. Data structure 200 alsoincludes copies of a data structure 204, which indexes pointers to datastructures for productions by a next daughter so that the input symbolcan be checked against the next daughter to see whether the nextdaughter has the input symbol as a left corner. This is because when aproduction is proposed, the next daughter is checked to see whether ithas the next input symbol as a left corner. This requires each entry inindex 204 to be checked against the next input symbol.

Data structure 200 also includes copies of a data structure 206, whichindexes pointers to data structures for productions by the mother of theproductions. This is so that a top-down check can be preformed to seewhether the mother is a left corner of some previous prediction. Thisensures that the mother of the production is consistent with what hasbeen found in the chart so far. Finally, the remaining portions of theproductions are enumerated. This is indicated by data portion 208 anddata structure 200.

FIG. 6B illustrates the direction of tracing through the data structure200 in performing the various checks just described. FIG. 6B furtherillustrates that each data structure holds a set of pointers to datastructures for productions based upon the index criteria. For example,data portion 202 holds pointers to data structures for productions basedon the left corner of those productions. Therefore, as the input text isbeing analyzed, data portion 202 is accessed and the partial analysis ofthe input text is compared against the values in data portion 202. Whena match is found, the pointer associated with that match is providedsuch that productions are identified that satisfy the left cornercriteria indexed in data portion 202.

The pointer, in one embodiment, points to a copy of data portion 204that indexes the productions by the possible next daughters forproductions having the left corner matched in data portion 202. When amatch is found in performing the left-corner check against the nextinput symbol, a pointer is obtained which points to a copy of dataportion 206 that indexes productions with the given left corner and nextdaughter by their mother such that a determination can be made as towhether the currently hypothesized productions are consistent with whathas been previously identified (i.e., whether the mother of theproduction is the left corner of some previous prediction). Finally, theremainders of the productions with a given left corner, next daughter,and mother are retrieved from the values in a copy of data portion 208.

A way to store the productions that results in faster parsing, inaccordance with one embodiment of the present invention, is toprecompute which productions are consistent with which input symbols, bydefining a structure that for each possible input symbol contains adiscrimination tree just for the productions whose second daughters havethat input symbol as a left corner. This entire structure is thereforeset out in the order shown for structure 212 in FIG. 7A:

As the parser works from left to right, at each point in the input, itlooks up the sub-structure for the productions consistent with the nextsymbol in the input. It processes them as before, except that the checkthat the second daughter has the next input symbol as a left corner isomitted, since that check was precomputed.

FIGS. 7A and 7B illustrate data structure 212 used in accordance withone embodiment of the present invention. Data portions which are thesame as those found in FIGS. 6A and 6B are correspondingly numbered.However, rather than beginning by indexing the productions according tothe left corner (or left-most daughter), data structure 212 begins byindexing productions whose second daughters have, as a left corner, thenext input symbol. This is indicated by data portion 214. In oneembodiment, data portion 214 holds pointers to data structures forproductions that have the next input symbol as a left corner to itssecond daughter. These pointers, in one embodiment, point to copies ofdata portion 202 that point to copies of data portions 206, and so on.The analysis then continues as discussed with respect to FIG. 6B,through the data portions 206 and 209. It will be noted that dataportion 209 now also contains the second daughters that were separatedout in the original method of indexing described with respect to FIGS.6A and 6B.

This way of indexing the productions can tend to increase storagerequirements. However, since the entire structure is indexed first byinput symbol, it is only necessary to load that part of the structureindexed by symbols that actually occur in the text being parsed. Thepart of the structure for the most common words of the language areillustratively pre-loaded; and since words seen once in a given texttend to be repeated, all of the structure that is loaded isillustratively retained until processing is complete or until itswitches to an unrelated text.

Grammar Flattening

One possible way of reducing the amount of work a parser has to do is toremove levels of structure from the grammar. For example, instead of theproductions:

NP→Name

Name→john

Name→mary

One could omit the category Name altogether, and simply use theproductions:

NP→john

NP→mary

Techniques for removing levels of structure from the grammar can bereferred to by the general term “grammar flattening”.

FIGS. 8 and 9 are graphs which further illustrate the concept of grammarflattening for the phrase “a young boy”. In FIG. 8, the head node of thegraph is a noun phrase and it extends four levels deep, ending with thewords in the phrase. In FIG. 9, the grammar has been flattened such thatit extends only three levels deep. In FIG. 9, the graph has a nounphrase head node and three descendent nodes (a determiner, an adjective,and a noun). The actual words in the phrase “a young boy” descend fromthese three descendent nodes.

In general, grammars can be flattened by taking a production, andsubstituting the sequence of daughters in the production for occurrencesof the mother of the production in other productions. This does notalways result in faster parsing.

However, in accordance with the embodiments of the present invention, anumber of specific ways of grammar flattening have been developed thatare effective in speeding up left-corner chart parsing. The first methodis referred to as “elimination of single-option chain rules”. If thereexists a nonterminal symbol A that appears on the left-hand side of asingle production A→X, where X is a single terminal or nonterminalsymbol, A→X is referred to as a “single-option chain rule”. Singleoption chain rules can be eliminated from a context-free grammar withoutchanging the language allowed by the grammar, simply by omitting theproduction, and substituting the single daughter of the production forthe mother of the production everywhere else in the grammar.

Elimination of single-option chain rules is perhaps the only method ofgrammar flattening that is guaranteed not to increase the size orcomplexity of the grammar. Grammar flattening involving nonterminalsdefined by multiple productions can result in a combinatorial increasein the size of the grammar. However, in accordance with one embodimentof the present invention, it has been found that if flattening isconfined to the leftmost daughters of productions, increased parsingspeeds can be achieved without undue increases in grammar size. Thesetechniques are referred to herein as “left-corner grammar flattening”.Two techniques of left-corner grammar flattening that generally speed upleft-corner chart parsing are as follows:

Technique 1: For each nonterminal A, such that

-   -   A is not a left-recursive category and    -   A does not occur as a daughter of a rule except as the left-most        daughter,        do the following:    -   For each production of the form A→X₁ . . . X_(n) and each        production of the form B→Aα, add B→X₁ . . . X_(n)α to the        grammar.    -   Remove all productions containing A from the grammar.

Technique 2: For each nonterminal A, such that

-   -   A is not a left-recursive category,    -   A does not occur as a daughter of a rule except as the left-most        daughter, and    -   there is some production that has A as the mother and at least        one nonterminal as a daughter,        do the following:    -   For each production of the form A→X₁ . . . X_(n) and each        production of the form B→Aα, add B→X₁ . . . X_(n)α to the        grammar.    -   Remove all productions containing A from the grammar.

FIGS. 10 and 11 are flow diagrams illustrating techniques 1 and 2discussed above, in greater detail. Techniques 1 and 2 restrict theimplementation of the grammar flattening to only non-left-recursivecategories and only if those categories only appear in a left cornerposition. Further, according to technique 2, the flattening operation isonly preformed if the category has at least one daughter that is also acategory. This additional restriction makes parsing slightly slower, butresults in a much more compact grammar.

Therefore, technique 1 discussed above first determines whether thecategory is a non-left-recursive category. This is indicated by block340 in FIG. 10. If not, the grammar flattening operation is notpreformed. If so, then it is determined whether the category onlyappears as a daughter of a production if it is the left corner of thatproduction. This is indicated by block 342. If not, again the flatteningoperation is not preformed.

If so, however, then the grammar is first flattened by addingproductions, as identified in block 344, and then removing allproductions containing the identified category from the grammar. This isindicated in block 346.

Technique 2, illustrated in FIG. 11, has a number of steps which aresimilar to those found in technique 1, illustrated in FIG. 10. Thosesteps are similarly numbered. Therefore, technique 2 first determineswhether the category A is non-left-recursive and whether A only appearsas a daughter of a production if it is the left corner of theproduction. This is indicated by blocks 340 and 342. However, FIG. 11illustrates that, prior to performing the grammar flattening, it isdetermined whether there is a production that has the category A as itsmother and at least one non-terminal as a daughter. This is indicated byblock 348. If not, then the grammar flattening stepwould only minimallyspeed up parsing, at the expense of significantly increasing the grammarsize, so the grammar flattening step is not performed. If so, however,then the two steps illustrated by blocks 344 and 346 in whichproductions are added to the grammar and all productions containing thecategory A are removed from the grammar (as discussed with respect toFIG. 10) are preformed.

It should be noted that a nonterminal is left-recursive if it is aproper left corner of itself, where X is recursively defined to be aproper left corner of A if the grammar contains a production of the formA→Xα or a production of the form B→Xα, where B is a proper left cornerof A. This and the elimination of left recursion are discussed ingreater detail in the above-referenced co-pending patent application.

Annotating Chart Edges for Extraction of Parses

The previously mentioned prior art technique of omitting recognizeddaughters from items leads to issues regarding how parses are to beextracted from the chart. The daughters to the left of the dot in anitem are often used for this purpose in item-based methods. However,other methods suggest storing with each non-initial edge in the chart alist that includes, for each derivation of the edge, a pair of pointersto the preceding edges (complete and incomplete edges) that caused it tobe derived. This provides sufficient information to extract the parseswithout additional searching, even without the daughters to the left ofthe dot.

One embodiment of the present invention yields further benefits. Foreach derivation of a non-initial edge, it is sufficient to attach to theedge, by way of annotation, only the mother category and the startingposition of the complete edge that was used in the last step of thederivation. It should also be noted that in left-corner parsing, onlynon-initial edges are ever added to the chart; however, this techniquefor annotating chart edges and extracting parses also works for otherparsing methods that do create initial edges in the chart.

FIG. 12A illustrates a data structure 350 which is attached to (orpointed to by) an edge in a chart being developed. Data structure 350simply includes two portions. The first portion 352 contains thecategory of the mother of the complete edge used in the last step ofderiving the non-initial edge. The second data portion 354, simplycontains the starting position in the input text of the complete edge,the mother of which is identified in portion 352. By storing one ofthese structures for each derivation of an edge, the edges can be tracedback to obtain a full analysis of the input text.

Every non-initial edge is derived by combining a complete edge with anincomplete edge. Suppose <A→.β,k,j> is a derived edge, and it is knownthat the complete edge used to derive this edge had category X and startposition i. It is then known that the complete edge must have been<X,i,j>, since the complete edge and the derived edge must have the sameend position. It is further known that the incomplete edge used in thederivation must have been <A→.Xβ,k,i>, since that is the only incompleteedge that could have combined with the complete edge to produce thederived edge. Any complete edge can thus be traced back to find thecomplete edges for all the daughters that derived it. The traceterminates when an incomplete edge is reached that has the same startpoint as the complete edge it was derived from. These “local”derivations can be pieced together to obtain a full analysis of theinput text.

For example, suppose that one has derived a complete edge <S,0,9> asillustrated in FIG. 12B, which we can also show as 358 (written inexpanded notation). It can be seen that if the data structure 360(representing the last complete edge used in deriving edge 358) isattached to 358, where 7 is the beginning or initial position of acomplete edge of category C, then one knows that 358 must have beenderived by combining the complete edge <C,7,9>, 361, and the incompleteedge <S→.C,0,7>, 362. If the incomplete edge 362 occurs in the chartwith the data structure 364 attached, one can see that 362 must havebeen derived from the complete edge <B,5,7>, 365, and the incompleteedge <S→.BC,0,5>, 366. Then if the data structure 368 is attached to366, one can see that 366 must have been derived from the complete edge<A,0,5>, 369, and the production S→ABC, 371. One can tell that this wasa production rather than another non-initial incomplete edge, because368 and 366 have the same start point. Thus we know that the originalcomplete edge <S,0,9> was derived from the sequence of complete edges<A,0,5>, <B,5,7>, and <C,7,9>. Since the categories of these completeedges may not be terminals, the trace-back process may need to berepeated for one or more of these complete edges as well. Using thederivation data structures attached to the chart records for theseedges, we can recursively extract the complete analysis of the entiresentence, down to the level of words.

FIG. 13 is a flow diagram illustrating how the information for thecomplete edges is stored. When a non-initial edge E is derived and addedto the chart, (as indicated by block 370) the mother category and thestarting position of the complete edge that was used to derive thenon-initial edge E are stored in the form of the data structure 350illustrated in FIG. 12A. This is indicated by block 372. Finally, apointer from the derived edge E to the mother and starting positionstored at block 372 are also stored. This is indicated by block 374. Itcan thus be seen that data structure 350 is quite abbreviated, and nopointer to an incomplete edge is even needed.

FIGS. 14A and 14B are flow diagrams which better illustrate thetrace-back process. First, in general, parsing proceeds left to rightuntil there are no more words in the input sentence. Then it can bedetermined whether there is a complete parse of the input by examiningthe chart to see if there is a complete edge of category S spanning theentire input, from 0 to n, if there are n words in the input sentence.If the application needs to retrieve the analyses of the sentence atthis point, then it initiates the trace-back process, beginning with thecomplete edge <S,0,n>. Initiation of the trace-back process is indicatedby block 376. The pointer to the derivation data structure associatedwith the derived edge currently under consideration is examined asindicated by block 378. The edge category and its starting position forsome derivation of the edge, which are pointed to at block 378, are thenretrieved. This is indicated by block 380. It should be noted that anedge may have several derivations, with a category/starting positionpair stored for each derivation. If one chooses only one pair for eachedge, a single analysis for the sentence is obtained. To obtain allanalyses, one must iterate through all derivations. The ending positionof the complete edge is then determined based on the ending position ofthe derived edge. This is indicated by block 382. The incomplete edgeused in the most recent derivation is computed. This is indicated byblock 384. The computed incomplete edge is then located in the chart,and it is determined whether more complete edges need to be retrieved.This is indicated by blocks 386 and 388. If so, the pointers associatedwith the most recently computed incomplete edge are examined for thelocation of the next edge category and starting position which needs tobe retrieved. This is indicated by block 390. Processing then reverts toblock 380 wherein the complete edge category and its starting positionare retrieved.

After all of the complete edges that compose the original derived edgehave been retrieved, the ones for nonterminal categories are traced backrecursively and the results are assembled into a complete analysis ofthe edge originally being traced back. This is indicated by block 392.

FIG. 14B is a more detailed flow diagram illustrating how the decisionin block 388 is made (and consequently how the trace-back terminates).It is determined whether the starting position of the most recentlycomputed-incomplete edge is the same as the most recently retrievedcomplete edge which it was derived from. This is indicated by block 394in FIG. 14B. If the starting positions are not the same, then additionaledges need to be retrieved in order to obtain the full analysis of theinput text segment. This is indicated by block 396. If the startingpositions are the same, then the most recent computation has yielded aproduction rather than an incomplete edge and no more edges need to beretrieved at this level of processing.

It can thus be seen that the present invention provides a number oftechniques and embodiments for improving the speed and efficiency ofparsing, and in some cases, specifically left-corner chart parsing.These improvements have been seen to increase the speed of theleft-corner chart-parsing algorithm by as much as 40 percent over thebest prior art methods currently known. These techniques can be usedalone or in any combination of ways to obtain advantages and benefitsover prior left-corner chart parsers.

Although the present invention has been described with reference topreferred embodiments, workers skilled in the art will recognize thatchanges may be made in form and detail without departing from the spiritand scope of the invention.

1. A method of indexing productions for use in a left-corner chartparser which parses input text containing input symbols, the methodcomprising: indexing the productions first based on input symbols whichare consistent with the productions; and generating a data structurecomprising one or more discrimination trees corresponding to productionswith one or more second daughters that have a left corner comprising oneof the input symbols.
 2. The method of claim 1 wherein indexingcomprises: precomputing which of the productions are consistent withwhich of the input symbols.
 3. The method of claim 2 whereinprecomputing comprises: precomputing, for each possible input symbol,which productions have a second daughter with that input symbol as aleft corner.
 4. The method of claim 1 and further comprising: indexingthe productions next based on a left-most daughter of the productions.5. The method of claim 4 and further comprising: indexing theproductions next based on a mother of the productions.
 6. The method ofclaim 5 and further comprising: enumerating the productions based onremainder of the productions, other than the left-most daughter and themother.
 7. A method of parsing input text using a left-corner chartparsing process, comprising: receiving an input symbol in the inputtext; accessing an input symbol index comprising one or morediscrimination trees corresponding to productions with one or moresecond daughters that have a left corner comprising one of the inputsymbols; obtaining productions from the input symbol index wherein theinput symbol is a left corner of the second daughter; and afterobtaining the productions having the input symbol as a left corner ofthe second daughter, accessing other indices to the productions.
 8. Themethod of claim 7 wherein accessing other indices comprises: accessing aleft-most daughter index to obtain productions based on their left-mostdaughter.
 9. The method of claim 8 wherein accessing other indicescomprises: accessing a mother index to obtain productions based on theirmother.
 10. The method of claim 9 and further comprising: accessing alist containing a completion of productions that are obtained byaccessing the left-most daughter index and the mother index.
 11. Acomputer-readable data structure indexing productions used in aleft-corner chart parser which parses input text, the data structurecomprising: one or more index portions implemented as a set of nestedhash tables, the index portions comprising a first index portionindexing the productions first based on input symbols which areconsistent with the productions; and a function which, when executed bythe computer, traces the indexed productions to parse the input text.12. The data structure of claim 11 wherein the first index portionindexes productions by input symbol based on which productions have theinput symbol as a left corner of the second daughter.
 13. The datastructure of claim 12 wherein the index portions further comprise asecond index portion indexing the productions based on a left-mostdaughter of the productions.
 14. The data structure of claim 13 whereinthe index portions further comprise a third index portion indexing theproductions based on a mother of the productions.
 15. The data structureof claim 14 wherein the index portions further comprise a fourth indexportion enumerating the productions based on a remainder of theproductions, other than the left-most daughter and the mother of theproductions.